mixed image
TdAttenMix: Top-Down Attention Guided Mixup
Wang, Zhiming, Gu, Lin, Lu, Feng
CutMix is a data augmentation strategy that cuts and pastes image patches to mixup training data. Existing methods pick either random or salient areas which are often inconsistent to labels, thus misguiding the training model. By our knowledge, we integrate human gaze to guide cutmix for the first time. Since human attention is driven by both high-level recognition and low-level clues, we propose a controllable Top-down Attention Guided Module to obtain a general artificial attention which balances top-down and bottom-up attention. The proposed TdATttenMix then picks the patches and adjust the label mixing ratio that focuses on regions relevant to the current label. Experimental results demonstrate that our TdAttenMix outperforms existing state-of-the-art mixup methods across eight different benchmarks. Additionally, we introduce a new metric based on the human gaze and use this metric to investigate the issue of image-label inconsistency. Project page: \url{https://github.com/morning12138/TdAttenMix}
TransformMix: Learning Transformation and Mixing Strategies from Data
Cheung, Tsz-Him, Yeung, Dit-Yan
Data augmentation improves the generalization power of deep learning models by synthesizing more training samples. Sample-mixing is a popular data augmentation approach that creates additional data by combining existing samples. Recent sample-mixing methods, like Mixup and Cutmix, adopt simple mixing operations to blend multiple inputs. Although such a heuristic approach shows certain performance gains in some computer vision tasks, it mixes the images blindly and does not adapt to different datasets automatically. A mixing strategy that is effective for a particular dataset does not often generalize well to other datasets. If not properly configured, the methods may create misleading mixed images, which jeopardize the effectiveness of sample-mixing augmentations. In this work, we propose an automated approach, TransformMix, to learn better transformation and mixing augmentation strategies from data. In particular, TransformMix applies learned transformations and mixing masks to create compelling mixed images that contain correct and important information for the target tasks. We demonstrate the effectiveness of TransformMix on multiple datasets in transfer learning, classification, object detection, and knowledge distillation settings. Experimental results show that our method achieves better performance as well as efficiency when compared with strong sample-mixing baselines.
SpliceMix: A Cross-scale and Semantic Blending Augmentation Strategy for Multi-label Image Classification
Wang, Lei, Zhan, Yibing, Ma, Leilei, Tao, Dapeng, Ding, Liang, Gong, Chen
Recently, Mix-style data augmentation methods (e.g., Mixup and CutMix) have shown promising performance in various visual tasks. However, these methods are primarily designed for single-label images, ignoring the considerable discrepancies between single- and multi-label images, i.e., a multi-label image involves multiple co-occurred categories and fickle object scales. On the other hand, previous multi-label image classification (MLIC) methods tend to design elaborate models, bringing expensive computation. In this paper, we introduce a simple but effective augmentation strategy for multi-label image classification, namely SpliceMix. The "splice" in our method is two-fold: 1) Each mixed image is a splice of several downsampled images in the form of a grid, where the semantics of images attending to mixing are blended without object deficiencies for alleviating co-occurred bias; 2) We splice mixed images and the original mini-batch to form a new SpliceMixed mini-batch, which allows an image with different scales to contribute to training together. Furthermore, such splice in our SpliceMixed mini-batch enables interactions between mixed images and original regular images. We also offer a simple and non-parametric extension based on consistency learning (SpliceMix-CL) to show the flexible extensibility of our SpliceMix. Extensive experiments on various tasks demonstrate that only using SpliceMix with a baseline model (e.g., ResNet) achieves better performance than state-of-the-art methods. Moreover, the generalizability of our SpliceMix is further validated by the improvements in current MLIC methods when married with our SpliceMix. The code is available at https://github.com/zuiran/SpliceMix.
Human-in-the-Loop Mixup
Collins, Katherine M., Bhatt, Umang, Liu, Weiyang, Piratla, Vihari, Sucholutsky, Ilia, Love, Bradley, Weller, Adrian
Aligning model representations to humans has been found to improve robustness and generalization. However, such methods often focus on standard observational data. Synthetic data is proliferating and powering many advances in machine learning; yet, it is not always clear whether synthetic labels are perceptually aligned to humans -- rendering it likely model representations are not human aligned. We focus on the synthetic data used in mixup: a powerful regularizer shown to improve model robustness, generalization, and calibration. We design a comprehensive series of elicitation interfaces, which we release as HILL MixE Suite, and recruit 159 participants to provide perceptual judgments along with their uncertainties, over mixup examples. We find that human perceptions do not consistently align with the labels traditionally used for synthetic points, and begin to demonstrate the applicability of these findings to potentially increase the reliability of downstream models, particularly when incorporating human uncertainty. We release all elicited judgments in a new data hub we call H-Mix.
Token-Label Alignment for Vision Transformers
Xiao, Han, Zheng, Wenzhao, Zhu, Zheng, Zhou, Jie, Lu, Jiwen
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs). They mix two images as inputs for training and assign them with a mixed label with the same ratio. While they are shown effective for vision transformers (ViTs), we identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We empirically observe that the contributions of input tokens fluctuate as forward propagating, which might induce a different mixing ratio in the output tokens. The training target computed by the original data mixing strategy can thus be inaccurate, resulting in less effective training. To address this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token. We reuse the computed attention at each layer for efficient token-label alignment, introducing only negligible additional training costs. Extensive experiments demonstrate that our method improves the performance of ViTs on image classification, semantic segmentation, objective detection, and transfer learning tasks. Code is available at: https://github.com/Euphoria16/TL-Align.
Learning Generative Models of Structured Signals from Their Superposition Using GANs with Application to Denoising and Demixing
Soltani, Mohammadreza, Jain, Swayambhoo, Sambasivan, Abhinav
In general the separation problem is inherently ill-posed; however, with enough structural assumption on X and N, it has been established that separation is possible. Depending on the application one might be interested in estimating only X (in this case, N is considered as the corruption), which is referred to as denoising, or in recovering both X and N which is referred to as demixing. Both demixing and denoising arise in a variety of important practical applications in the areas of signal/image processing, computer vision, machine learning, and statistics [Chen et al., 2001, Elad et al., 2005, Bobin et al., 2007, Candès et al., 2011]. Most of the existing techniques assume some prior knowledge on the structures of X and N in order to recover the desired component signal(s). Prior knowledge about the structure of X and N can only be obtained if one has access to the generative mechanism of the signals or has access to clean samples from the probability distribution defined over sets X and N . In many practical settings, neither of these may be feasible. In this paper, we consider the problem of separating constituent signals from superposed observations when clean access to samples from the distribution is not available.
Between-class Learning for Image Classification
Tokozume, Yuji, Ushiku, Yoshitaka, Harada, Tatsuya
In this paper, we propose a novel learning method for image classification called Between-Class learning (BC learning). We generate between-class images by mixing two images belonging to different classes with a random ratio. We then input the mixed image to the model and train the model to output the mixing ratio. BC learning has the ability to impose a constraint on the shape of the feature distributions, and thus the generalization ability is improved. BC learning is originally a method developed for sounds, which can be digitally mixed. Mixing two image data does not appear to make sense; however, we argue that because convolutional neural networks have an aspect of treating input data as waveforms, what works on sounds must also work on images. First, we propose a simple mixing method using internal divisions, which surprisingly proves to significantly improve performance. Second, we propose a mixing method that treats the images as waveforms, which leads to a further improvement in performance. As a result, we achieved 19.4% and 2.26% top-1 errors on ImageNet-1K and CIFAR-10, respectively.